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Abstract 

Background: A conformational epitope (CE) in an antigentic protein is composed of amino acid residues that are 
spatially near each other on the antigen's surface but are separated in sequence; CEs bind their complementary 
paratopes in B-cell receptors and/or antibodies. CE predication is used during vaccine design and in immuno- 
biological experiments. Here, we develop a novel system, CE-KEG, which predicts CEs based on knowledge-based 
energy and geometrical neighboring residue contents. The workflow applied grid-based mathematical 
morphological algorithms to efficiently detect the surface atoms of the antigens. After extracting surface residues, 
we ranked CE candidate residues first according to their local average energy distributions. Then, the frequencies at 
which geometrically related neighboring residue combinations in the potential CEs occurred were incorporated 
into our workflow, and the weighted combinations of the average energies and neighboring residue frequencies 
were used to assess the sensitivity, accuracy, and efficiency of our prediction workflow. 

Results: We prepared a database containing 247 antigen structures and a second database containing the 163 
non-redundant antigen structures in the first database to test our workflow. Our predictive workflow performed 
better than did algorithms found in the literature in terms of accuracy and efficiency. For the non-redundant 
dataset tested, our workflow achieved an average of 47.8% sensitivity, 84.3% specificity, and 80.7% accuracy 
according to a 10-fold cross-validation mechanism, and the performance was evaluated under providing top three 
predicted CE candidates for each antigen. 

Conclusions: Our method combines an energy profile for surface residues with the frequency that each 
geometrically related amino acid residue pair occurs to identify possible CEs in antigens. This combination of these 
features facilitates improved identification for immuno-biological studies and synthetic vaccine design. CE-KEG is 
available at http://cekeg.cs.ntou.edu.tw. 
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Introduction 

A B-cell epitope, also known as an antigenic determi- 
nant, is the surface portion of an antigen that interacts 
with a B-cell receptor and/or an antibody to elicit either 
a cellular or humoral immune response [1,2]. Because of 
their diversity, B-cell epitopes have a huge potential for 
immunology-related applications, such as vaccine design 
and disease prevention, diagnosis, and treatment [3,4]. 
Although clinical and biological researchers usually 
depend on biochemical/biophysical experiments to iden- 
tify epitope-binding sites in B-cell receptors and/or anti- 
bodies, such work can be expensive, time-consuming, 
and not always successful. Therefore, in silico methods 
that can reliably predict B-cell epitopes would simplify 
immunology-related experiments [5]. Given accurate 
epitope-prediction tools, immunologists can then focus 
on the appropriate protein residues and reduce their 
experimental efforts. 

In general, epitopes are described as linear (continuous) 
or conformational (discontinuous) [6]. A linear epitope 
(LE) is a short, continuous sequence of amino acid resi- 
dues on the surface of an antigen. Although an isolated LE 
is usually flexible, which destroys any information con- 
cerning its conformation in the protein, it can adapt that 
conformation to react weakly with a complementary anti- 
body. Conversely, a conformational epitope (CE) is com- 
posed of residues that are not sequential but are near in 
space [7]. Several algorithms, which require a protein 
sequence as input, are available for LE prediction, includ- 
ing BEPITOPE [8], BCEPred [9], BepiPred [10], ABCpred 
[11], LEPS [12,13] and BCPreds [14]. These algorithms 
assess the physicochemical propensities, such as polarity, 
charge, or secondary structure, of the residues within the 
targeted protein sequence, and then apply quantitative 
matrices or machine-learning algorithms, such as the hid- 
den Markov model, a support vector machine algorithm, 
or an artificial neural network algorithm, to predict LEs. 
However, the number of LEs on native proteins has been 
estimated to be -10% of all B-cell epitopes, and most 
B-cell epitopes are CEs [15]. Therefore, to focus on the 
identification of CEs is the more practical and valuable 
task. For CE prediction, several algorithms have been 
developed including CEP [16], DiscoTope [17], PEPOP 
[18], ElliPro [19], PEPITO [20], and SEPPA [21], all of 
which use combinations of the physicochemical character- 
istics of known epitope residues and trained statistical fea- 
tures of known antigen- antibody complexes to identify CE 
candidates. 

A different approach relies on phage display to produce 
peptide mimotopes that can be used to characterize the 
relationship between an epitope and a B-cell receptor or 
an antibody. Peptide mimotopes bind B-cell receptors 
and antibodies in a manner similar to those of their 



corresponding epitopes. LEs and CEs can be identified by 
mimotope phage display experiments. MIMOP is a hybrid 
computational tool that predicts epitopes from information 
garnered from mimotope peptide sequences [22]. Similarly, 
Mapitope and Pep-3D-Search use mimotope sequences to 
search linear sequences for matching patterns of structures 
on antigen surfaces. Other algorithms can identify CE resi- 
dues with the use of the Ant Colony Optimization algo- 
rithm and statistical threshold parameters based on 
nonsequential residue pair frequencies [23,24]. 

Crystal and solution structures of the interfaces of 
antigen-antibody complexes characterize the binding 
specificities of the proteins in terms of hydrogen bond 
formation, van der Walls contacts, hydrophobicity and 
electrostatic interactions (reviewed by [25]). Only a 
small number residues located within the antigen-anti- 
body interface energetically contribute to the binding 
affinity, which defines these residues as the "true" anti- 
genic epitope [26]. Hence, we hypothesized that the 
energetically important residues in epitopes could be 
identified in silico. We assumed that the free, overall 
native antigen structure is the lowest free energy state, 
but that residues involving in antibody binding would 
possess higher potential energies. Two types of potential 
energy functions are currently used for energy calcula- 
tions involving proteins: a physical-based potential func- 
tion that focuses on the fundamental forces between 
atoms, and a knowledge-based potential that relies on 
parameters derived from experimentally solved protein 
structures [27]. Owing to the heavy computational com- 
plexity required for the first approach, we adopted the 
knowledge-based potential for our workflow. The energy 
functions for the surface residues used are those of the 
Protein Structure Analysis website [28]. 

Additionally, a study concerning LE prediction [29] 
showed that certain sequential residue pairs occur more 
frequently in LE epitopes than in non-epitopes. A similar 
statistical feature may, therefore, enhance the performance 
of a CE prediction workflow. Hence, we incorporated the 
statistical distribution of geometrically related pairs of resi- 
dues found in verified CEs and the identification of resi- 
dues with relatively high energy profiles. We first located 
surface residues with relatively high knowledge-based 
energies within a specified radius of a sphere and assigned 
them as the initial anchors of candidate epitope regions. 
Then we extended the surfaces to include neighboring 
residues to define CE clusters. For this report, the distribu- 
tions of energies and combined with knowledge of geome- 
trically related pairs residues in true epitopes were 
analyzed and adopted as variables for CE prediction. The 
results of our developed system indicate that it provides 
an outstanding CE prediction with high specificity and 
accuracy. 
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Methods 

CE-KEG workflow architecture 

The proposed CE prediction system based on knowl- 
edge-based energy function and geometrical neighboring 
residue contents is abbreviated as "CE-KEG". CE-KEG is 
performed in four stages: analysis of a grid-based pro- 
tein surface, an energy-profile computation, anchor 
assignment, and CE clustering and ranking (Figure 1). 

The first module in the "Grid-based surface structure 
analysis" accepts a PDB file from the Research Collabora- 
tor for Structural Bioinformatics Protein Data Bank [30] 
and performs protein data sampling (structure discretiza- 
tion) to extract surface information. Subsequently, three- 
dimensional (3D) mathematical morphology computa- 
tions (dilation and erosion) are applied to extract the sol- 
vent accessible surface of the protein in the "Surface 
residue detection" submodule [31], and surface rates for 
atoms are calculated by evaluating the exposure ratio 
contacted by solvent molecules. Then, the surface rates 
of the side chain atoms of each residue are summed, 
expressed as the residue surface rate, and exported to a 
look-up table. The next module is "Energy profile com- 
putation" that uses calculations performed at the ProSA 
web system to rank the energies of each residue on the 
targeted antigen surface(s) [28]. Surface residues with 
greater energies and located at mutually exclusive 



positions are considered as the initial CE anchors. The 
third module is "Anchor assignment and CE clustering" 
which performs CE neighboring residue extensions using 
the initial CE anchors to retrieve neighboring residues 
according to energy indices and distances among anchor 
and extended residues. Additionally, the frequencies of 
occurrence of pair-wise amino acids are calculated to 
select suitable potential CE residue clusters. For the final 
module, "CE ranking and output result" the values of the 
knowledge-based energy propensities calculated in mod- 
ule 2 and the frequencies of occurrence of the geometri- 
cally related residue pairs are weighted and then 
combined to provide CE predictions. 

Preparation of test datasets 

The epitope data derived from the DiscoTope server, the 
Epitome database, and the Immune Epitope Database 
(IEDB) were collected to validate the performance of CE- 
KEG. Using DiscoTope, we obtained a benchmark data- 
set of 70 antigen-antibody complexes from the SACS 
database [32]. These complexes had been solved to at 
least 3-A resolution, and the antigens contained more 
than 25 residues. The epitope residues in this dataset 
were defined and chosen as those within 4 A of the resi- 
dues directly bound to the antibody (tied residues). The 
Epitome dataset contained 134 antigens which were 
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Figure 1 CE prediction workflow 
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inferred by the distances between the antigens and the 
complementary-determining of the corresponding anti- 
bodies, and these antigens were also successfully analyzed 
through ProSA's energy function evaluation. Epitome 
labels residues as interaction sites if an antigen atom is 
within 6 A of a complementary-determining antibody 
region. The IEDB dataset was initially composed of 56 
antigen chains acquired at the IEDB website (http://www. 
immuneepitope.org). This dataset contained only anti- 
gens for which the complex-structure annotation "Com- 
plexPdbld" was present in the "iedb_export" zip file. 
Because 11 of these antigens contained fewer than 35 
residues and 2 antigens could not be successfully ana- 
lyzed by ProSA, we only retained 43 antigen-antibody 
complexes in the final IEDB dataset. In brief, the total 
number of testing antigens from previous three resources 
is 247, and after removing duplicate antigens, a new test- 
ing dataset containing 163 non-redundant antigens is 
used for validation of CE-KEG. 

Surface structure analysis 

The interaction between an antigen and an antibody 
usually depends on their surface resides. The concepts of 
solvent accessible and molecular surfaces for proteins 
were first suggested by Lee and Richards [33] (Figure 2). 
Later, Richards introduced the molecular surface con- 
structs contact and re-entrant surfaces. The contact sur- 
face represents the part of the van der Waals surface that 
directly interacts with solvent. The re-entrant surface is 
defined by the inward-facing part of a spherical probe that 
touches more than one protein surface atom [34]. In 1983, 



Connolly employed the Gauss-Bonnet approach to calcu- 
late a molecular surface, which is defined by a small-sized 
probe that is rolled over a protein's surface [31]. On the 
basis of the definitions given above, we developed a grid- 
based algorithm that could efficiently identify surface 
regions of a protein. 

3D mathematical morphology operations 

Mathematical morphology was initially proposed as a 
rigorous theoretic framework for shape analysis of bin- 
ary images. Here, we employed the 3D mathematical 
morphological dilation and erosion operations for sur- 
face region calculations. Based on superior characteris- 
tics of morphology in terms of describing shape and 
structural characteristics, an efficient and effective algo- 
rithm was designed to detect precise surface rates for 
each residue. The query antigen structure was denoted 
as X as an object in a 3D grid: 

X= {v:/(v) = l,v= {x,y,z) e Z 3 }. 

where / is called as the characteristic function of X- 
On the other hand, the background X c is defined as: 

X c = {v:/(v) = 0,v= {x,y,z) e Z 3 }. 

A 1.5-A radius sphere is employed as a fundamental 
structure element B. The symmetric of B with respect to 
the origin (0, 0, 0) is denoted as B s and written as 

B s = {-v : v € B}. 
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The translation of B by vector d is denoted and 
performed as 

B d = {v + d:veB}. 

The three elementary morphological operators listed 
below are then applied for the surface region 
calculation. 

Dilation: X D = X 0 = {v e Z 3 : B ly nX ^ 0} 
Erosion: X E = X D Q B\ = {v e Z 3 : B 2v C X D } 
Difference: Xp — Xe 

where the X is the original structure, Xq is a dilated 
structure by the structuring element By Xe denotes the 
eroded structure from Xq by a larger structuring ele- 
ment B2 compared to By and the surface regions can be 
achieved by taking difference between Xq and Xe- The 
surface rate for each atom is obtained by calculating the 
ratio of the intersected and non-intersected regions with 
respect to the overlapping areas between the morpholo- 
gical difference operations and the original protein 
atoms. Figure 3 depicts the step-by-step procedure used 
to extract the surface regions and to calculate the sur- 
face rate for an atom. 



Surface rate computations 

The properties of the side chains of the residues in an 
epitope are important factors controlling protein-protein 
interactions. Much literature deals with the influence of 
side chains as factors affecting protein binding. Antigen- 
antibody binding may cause conformational changes in 
the proteins, and amino acids that have flexible side 
chains may, therefore, have an advantage. Experimen- 
tally, nonpolar-nonpolar and polar-polar side chain 
interactions stabilize protein interfaces [35]. Therefore, 
we considered side chain characteristics in our work- 
flow. With the use of 3D mathematical morphology 
operations, the rate of each atom, AR(r), can be deter- 
mined although only the rates of surface side-chain 
were considered. The surface rate of each residue is 
denoted SR(r) and calculated as: 

SR(r) = |ieR:i^AR(r)J 

where i represents the i th surface atom in the side chain 
of a residue, R is all surface atoms in a residue, and N is 
the total number of surface atoms in residue "r". 




4. Difference result ( 2 - 3 ) 5 ' atom surface rate »"»P«rt»«°n 

Figure 3 3D morphology operations used for surface rate calculations. Shown in the figure are the original, dilated, and eroded structures, 
the difference between the dilated and eroded structures, and the final atomic surface region. 
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Using the equation given directly above, statistics for the 
surface rates of verified epitope residues and of all surface 
residues in the non-redundant dataset were acquired, and 
their distributions are illustrated in Figure 4, which shows 
that the side chains of residues of known CEs often pos- 
sessed higher surface rates than do the averaged total 
areas of the antigens. After calculating the surface rates, 
they were imported into a file, and a minimum threshold 
value for the surface rate was set to be used in the predic- 
tive workflow. 

Energy profile computation 

We used the knowledge-based approach to calculate the 
energy of each surface residue [28], in conjunction with 
the distribution of pairwise distances to extract the effec- 
tive potentials between residues. The potential energy of 
each residue was calculated using a heavy-atom represen- 
tation, with the heavy atoms categorized according to the 
residue in which they were found. The potential calcula- 
tion represents the ratio between the observed and 
expected number of contacts for a pair of heavy atoms 
within a specified distance. The potential value for two 
atoms reflects the level of attractive interaction between 
the two residues. Although this knowledge-based potential 
has usually been used to improve fold recognition, and 
structure prediction and refinement, we adopted to calcu- 
late the energy of each surface residue so as to distinguish 
among active state conditions. To assess differences in the 
potentials of CE and non-CE residues, we calculated their 
surface energy profiles under a variety of parameter set- 
tings for 247 known antigens. We found that CE residues 
possess a higher energy function than do non-epitope 



residues. When the window size was set to eight residues, 
the average energy for each verified CE residue cluster in 
an antigen from the Epitome, DiscoTope, and IEDB data- 
sets was 69.4%, 82.9%, and 51.2% greater than the average 
energy of non-CE residues in the same antigen, respec- 
tively. We also observed that at least one CE residue in 
each antigen had an energy that was in the top 20% of all 
surface residues, and most of the largest energies for the 
CE residues ranked in the top 3%. Therefore, we selected 
the 20% of the residues with the greatest energies as our 
initial CE anchors. Additionally, the selected initial seeds 
were required to possess surface rates within the distribu- 
tion range of 20% to 50% shown in Figure 4. We also spe- 
cified that the anchor residues should be separated by at 
least 12-A to eliminate possible overlapping CE candi- 
dates. With the identities of the initial seeds decided, the 
relationship between geometrically related neighboring 
residues within a 10-A radius sphere of the anchor residue 
were examined. 

Frequency of occurrence of geometrically related 
residue pairs 

The filtering mechanism used was adopted from a sug- 
gestion by Chen that involves the use statistical features 
for CE verification [29]. However, unlike Chen's proposal 
that used pairs of sequential residues, CE-KEG incorpo- 
rated geometrically related neighboring residue pairs. 
Table 1 shows variables used for the statistical analysis of 
the residue pairs. Because there are 20 different amino 
acids, 210 possible unique combinations of pairs are pos- 
sible, for which we determined the number of times that 
they were found within CEs and non-CEs. Additionally, 



Surface Rate Statistics 
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Figure 4 The distribution of surface rates for residues in known CE epitopes and all surface residues in the antigen dataset 
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Table 1 Variables used in the statistical analysis of geometrically related amino acid pairs (GAAP). 



Variables 


Description 


GAAP 


The number of times a geometrically related residues pair occurs in the known CE epitope dataset. 


Ngaap 


The number of times a geometrically related amino acid pair occurs in the non-CE epitope dataset. 


fcAAP 


The frequency (%) that a geometrically related amino acid pair occurs in the known CE epitope dataset. 


fGAAP 


The frequency {%) that a geometrically related amino acid pair occurs in the non-CE epitope dataset. 


Total + GAAP 


The total number of times that all geometrical amino acid pairs occur in the known CE epitope dataset. 


Total GAAP 


The total number of times that all geometrical amino acid pairs occur in the non-CE epitope dataset. 


CEIgaap 


CEI for a geometrically related amino acid pair. 



the residue pairs found more frequently within spheres of 
various radii ranging from 2 A to 6 A were analyzed 
respectively, and their corresponding CE indices (CEIs) 
were also calculated for default settings. 

The CE Index (CEl GAAP ) was obtained by calculating 
the frequency of occurrence that a pair of geometrically 
related amino acid in the CE dataset divided by the fre- 
quency that the same pair in the non-CE epitope data- 
set. This value was converted into its logio value and 
then normalized. For example, the total number of all 
geometrically related residue pairs in the known CE epi- 
topes is 2843, and the total number of geometrically 
related pairs in non-CE epitopes is 36,118 when the 
pairs of residues were within a sphere of radius 2 A. 
The two greatest CEIs are for the residue pairs H/Q 
(0.921) and E/H (0.706) found in from the 247 antigens. 

After determining the CEI for each pair of residues, 
those for a predicted CE cluster were summed and 
divided by the number of CE pairs within the cluster to 
obtain the average CEI for a predicted CE patch. Finally, 
the average CEI was multiplied by a weighting factor and 
used in conjunction with a weighted energy function to 
obtain a final CE combined ranking index. On the basis 
of the averaged CEI, the prediction workflow provides 
the three highest ranked predicted CEs as the best candi- 
dates. An example of workflow is shown in Figure 5 for 
the KvAP potassium channel membrane protein (PDB 
ID: 10RS:C) [36]. Protein surface delineation, identifica- 
tion of residues with energies above the threshold, pre- 
dicted CE clusters, and the experimentally determined 
CE are shown in Figure 5a, b, c, and 5d, respectively. 

Results 

In this report, we present a new CE predictor system 
called CE-KEG that combine an energy function compu- 
tation for surface residues and the importance of 
occurred neighboring residue pairs on the antigen surface 
based on previously known CEs. To verify the perfor- 
mance of CE-KEG, we tested it with datasets of 247 anti- 
gen structures and 163 non-redundant protein structures 
that had been obtained from three benchmark datasets in 



conjunction with a 10-fold cross-validation assessment. 
The known CEs had been experimentally determined or 
computationally inferred prior to our study. For a query 
protein, we selected the best CE cluster form top three 
predicted candidate groups and calculated the number of 
true CE residues correctly predicted by our system to be 
epitope residues (TP), the number of non-CE residues 
incorrectly predicted to be epitope residues (FP), the 
number of non-CE residues correctly predicted not to be 
epitope residues (TN), and the number of true CE resi- 
dues incorrectly predicted as non-epitope residues (FN). 
The following parameters were calculated for each pre- 
diction using the TP, FP, TN, and FN values and were 
used to evaluate the relative weights of the energy func- 
tion and occurrence frequency used during the predic- 
tions: 

Sensitivity(SE) = TP [TP + FN] 
Specificity(SP) = TN [TN + FP] 
Positive Prediction Value (PPV) = TP -I- [TP + FP] 

Accuracy(ACC) = [TP + TN] + [TP + TN + FN + FP] 

Table 2 shows the predictions when the average energy 
function of CE residues located within a sphere of 8-A 
radius and the frequencies of occurrence for geometri- 
cally related residue pairs are combined with different 
weighting coefficients, whereas Table 3 shows the results 
when the energies of individual residues are considered. 
The results show that the performance is better when the 
average energy is used as compared with the energy of 
single residues are considered. However, both approaches 
yield a similar performance for sensitivity, specificity, 
positive prediction value, and accuracy. For sensitivity, 
the best average energy weighting coefficient is 10%, 
which is a consequence of the energy function having 
been applied prior to the CE-anchor-selection step. 
Therefore, the energy function of the residues will not 
have an obvious effect on the prediction results. In this 
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Figure 5 Example of predicted CE clusters and true CE. (A) Protein surface of KvAP potassium channel membrane protein (PDB ID: 10RS:C). 
(B) Surface seed residues possessing energies within the top 20%. (C) Top three predicted CEs for 10RS:C. Predicted CEs were obtained by 
filtering, region growing, and CE cluster ranking procedures. The filtering step removing neighboring residues located within 12 A according to 
the energy ranked seed. Region growing formulated the CE cluster from previous filtered seed residues to extend neighboring residues within 
10 A radius. CE clusters were ranking by calculating the combination of weighted CEI and Energy scores. (D) Experimentally determined CE 
residues. 



case, the initial parameter settings for new target antigen 
and the following 10-fold verification will apply with 
these trained combinations. 

To evaluate CE-KEG, we adopted a 10-fold cross-vali- 
dation test. The 247 antigens derived from the Disco- 
Tope, Epitome, and IEDB datasets and the 163 non- 
redundant antigens were tested as individual datasets. 
These datasets were randomly partitioned into 10 subsets 
respectively. Each partitioned subset was retained as the 
validation proteins for evaluating the prediction model, 
and the remaining 9 subsets were applied as training data 



for setting best default parameters. The cross-validation 
process is repeated for ten times and each of the ten sub- 
sets was applied exactly once as the validation subset. 
The final measurements were then obtained by taking 
average from individual ten prediction results. For the set 
of 247 antigens, the CE-KEG achieved an average sensi- 
tivity of 52.7%, an average specificity of 83.3%, an average 
positive prediction value of 29.7%, and an average accu- 
racy of 80.4%. For the set of non-redundant 163 antigens, 
the average sensitivity was 47.8%; the average specificity 
was 84.3%; the average positive prediction value was 
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Table 2 Average performance of the CE-KEG for using 
average energy function of local neighboring residues. 



Weighing Combinations SE SP PPV ACC 



0%EG+100% GAAP 


0.478 


0.831 


0.266 


0.796 


1 0%EG + 90% GAAP 


0.490 


0.831 


0.273 


0.797 


20%EG + 80% GAAP 


0.492 


0.831 


0.275 


0.797 


30%EG + 70% GAAP 


0.497 


0.831 


0.277 


0.798 


40%EG + 60% GAAP 


0.493 


0.832 


0.280 


0.799 


50%EG + 50% GAAP 


0.503 


0.834 


0.284 


0.801 


60%EG + 40% GAAP 


0.504 


0.834 


0.284 


0.801 


70%EG + 30% GAAP 


0.519 


0.839 


0.294 


0.808 


80%EG + 20% GAAP 


0.531 


0.840 


0.300 


0.811 


90%EG + 1 0% GAAP 


0.521 


0.839 


0.294 


0.809 


1 00%EG + 0% GAAP 


0.496 


0.837 


0.279 


0.805 



The performance used combinations of weighting coefficients for the average 
energy (EG) and frequency of geometrically related pairs of predicted CE 
residues (GAAP) within a 8-A radius sphere. The highest SE is denoted by a 
bold-italic face. 



29.9%; and the average accuracy was 80.7%. For these two 
datasets, the number of CE clusters assessed was three 
top predicted ones. 

Discussion and conclusion 

With the rapidly increasing number of solved protein 
structures, CE prediction has become a necessary tool pre- 
liminary to wet biomedical and immunological experi- 
ments. For the work reported herein, we developed and 
tested a novel workflow for CE prediction that combines 
surface rate, a knowledge-based energy function, and the 
geometrical relationships between surface residue pairs. 
Because certain existing CE prediction systems do not 
allow the user to evaluate the values of area under receiver 
operating characteristic curve (AUC) by altering the para- 
meter settings, an alternatively approximate evaluation of 
the AUC can be made using the average of the specificity 



Table 3 Average performance of the CE-KEG for energy 
function of single residue. 



Weighting Combinations 


SE 


SP 


PPV 


ACC 


0%EG+100% GAAP 


0.478 


0.831 


0.266 


0.796 


1 0%EG + 90% GAAP 


0.463 


0.827 


0.260 


0.790 


20%EG + 80% GAAP 


0.473 


0.827 


0.265 


0.791 


30%EG + 70% GAAP 


0.476 


0.828 


0.268 


0.792 


40%EG + 60% GAAP 


0.483 


0.832 


0.275 


0.796 


50%EG + 50% GAAP 


0.466 


0.831 


0.273 


0.795 


60%EG + 40% GAAP 


0.476 


0.833 


0.280 


0.797 


70%EG + 30% GAAP 


0.485 


0.832 


0.281 


0.797 


80%EG + 20% GAAP 


0.480 


0.830 


0.278 


0.796 


90%EG + 10% GAAP 


0.481 


0.831 


0.275 


0.797 


1 00%EG + 0% GAAP 


0.463 


0.830 


0.265 


0.795 



The performance used combinations of weighting coefficients for the energy 
(EG) of individual residues and the frequency of occurrence for geometrically 
related pairs (GAAP). The highest SE is denoted by a bold-italic face. 



and sensitivity [21]. For example, in comparison with the 
prediction performance of the DiscoTope system using 
the DiscoTope benchmark dataset (70 antigens), our 
workflow provides a better average specificity (83.2% vs. 
75%), and a better average sensitivity (62.0% vs. 47.3%). 
Hence, the AUC value (0.726) returned by CE-KEG is 
superior to that found for DiscoTope (0.612). To compare 
CE-KEG with PEPITO (BEPro) system, we used both the 
Epitome and DiscoTope datasets. The PEPITO system 
returning averaged AUC values of 0.683 and 0.753, respec- 
tively, which are comparable with AUC values of 0.655 
and 0.726, respectively returned by CE-KEG. The average 
number of predicted CEs by employing CE-KEG is 
approximately six with the most likely predicted CEs 
ranked at an average position of 2.9. This finding was why 
we included the top three CEs in our subsequent analysis. 
Because CE-KEG limits the distance when extending 
neighboring residues, it predicts CEs that contain a rela- 
tively small number of residues. Therefore, CE-KEG per- 
forms better than the other tested systems in terms of 
specificity; however, the sensitivity value is decreased. 
Future research could focus on the distributions of various 
physicochemical propensities for epitope and non-epitope 
surfaces such as the specific geometrical shapes of antigen 
surfaces, and the unique interactions between antigens 
and antibodies. Such information may facilitate the appro- 
priate selection of initial CE anchors and provide precise 
CE candidates for immunological studies. 
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